{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Amazon SageMaker Processing jobs\n",
    "\n",
    "With Amazon SageMaker Processing jobs, you can leverage a simplified, managed experience to run data pre- or post-processing and model evaluation workloads on the Amazon SageMaker platform.\n",
    "\n",
    "A processing job downloads input from Amazon Simple Storage Service (Amazon S3), then uploads outputs to Amazon S3 during or after the processing job.\n",
    "\n",
    "<img src=\"Processing-1.jpg\">\n",
    "\n",
    "This notebook shows how you can:\n",
    "\n",
    "1. Run a processing job to run a scikit-learn script that cleans, pre-processes, performs feature engineering, and splits the input data into train and test sets.\n",
    "2. Run a training job on the pre-processed training data to train a model\n",
    "3. Run a processing job on the pre-processed test data to evaluate the trained model's performance\n",
    "4. Use your own custom container to run processing jobs with your own Python libraries and dependencies."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data pre-processing and feature engineering"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To run the scikit-learn preprocessing script as a processing job, create a `SKLearnProcessor`, which lets you run scripts inside of processing jobs using the scikit-learn image provided."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Processing job name:  amazon-reviews-scikit-processor-2020-03-02-04-59-28\n"
     ]
    }
   ],
   "source": [
    "import boto3\n",
    "import sagemaker\n",
    "import pandas as pd\n",
    "from sagemaker.sklearn.processing import SKLearnProcessor\n",
    "from time import gmtime, strftime\n",
    "\n",
    "sess   = sagemaker.Session()\n",
    "bucket = sess.default_bucket()\n",
    "role = sagemaker.get_execution_role()\n",
    "region = boto3.Session().region_name\n",
    "\n",
    "#sm = boto3.Session().client(service_name='sagemaker', region_name=region)\n",
    "\n",
    "timestamp_prefix = strftime(\"%Y-%m-%d-%H-%M-%S\", gmtime())\n",
    "\n",
    "output_prefix = 'amazon-reviews-scikit-processor-{}'.format(timestamp_prefix)\n",
    "processing_job_name = 'amazon-reviews-scikit-processor-{}'.format(timestamp_prefix)\n",
    "\n",
    "print('Processing job name:  {}'.format(processing_job_name))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Before introducing the script you use for data cleaning, pre-processing, and feature engineering, inspect the first 20 rows of the dataset. The target is predicting the `income` category. The features from the dataset you select are `age`, `education`, `major industry code`, `class of worker`, `num persons worked for employer`, `capital gains`, `capital losses`, and `dividends from stocks`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>age</th>\n",
       "      <th>class of worker</th>\n",
       "      <th>detailed industry recode</th>\n",
       "      <th>detailed occupation recode</th>\n",
       "      <th>education</th>\n",
       "      <th>wage per hour</th>\n",
       "      <th>enroll in edu inst last wk</th>\n",
       "      <th>marital stat</th>\n",
       "      <th>major industry code</th>\n",
       "      <th>major occupation code</th>\n",
       "      <th>...</th>\n",
       "      <th>country of birth father</th>\n",
       "      <th>country of birth mother</th>\n",
       "      <th>country of birth self</th>\n",
       "      <th>citizenship</th>\n",
       "      <th>own business or self employed</th>\n",
       "      <th>fill inc questionnaire for veteran's admin</th>\n",
       "      <th>veterans benefits</th>\n",
       "      <th>weeks worked in year</th>\n",
       "      <th>year</th>\n",
       "      <th>income</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>73</td>\n",
       "      <td>Not in universe</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>High school graduate</td>\n",
       "      <td>0</td>\n",
       "      <td>Not in universe</td>\n",
       "      <td>Widowed</td>\n",
       "      <td>Not in universe or children</td>\n",
       "      <td>Not in universe</td>\n",
       "      <td>...</td>\n",
       "      <td>United-States</td>\n",
       "      <td>United-States</td>\n",
       "      <td>United-States</td>\n",
       "      <td>Native- Born in the United States</td>\n",
       "      <td>0</td>\n",
       "      <td>Not in universe</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>95</td>\n",
       "      <td>- 50000.</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>58</td>\n",
       "      <td>Self-employed-not incorporated</td>\n",
       "      <td>4</td>\n",
       "      <td>34</td>\n",
       "      <td>Some college but no degree</td>\n",
       "      <td>0</td>\n",
       "      <td>Not in universe</td>\n",
       "      <td>Divorced</td>\n",
       "      <td>Construction</td>\n",
       "      <td>Precision production craft &amp; repair</td>\n",
       "      <td>...</td>\n",
       "      <td>United-States</td>\n",
       "      <td>United-States</td>\n",
       "      <td>United-States</td>\n",
       "      <td>Native- Born in the United States</td>\n",
       "      <td>0</td>\n",
       "      <td>Not in universe</td>\n",
       "      <td>2</td>\n",
       "      <td>52</td>\n",
       "      <td>94</td>\n",
       "      <td>- 50000.</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>18</td>\n",
       "      <td>Not in universe</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>10th grade</td>\n",
       "      <td>0</td>\n",
       "      <td>High school</td>\n",
       "      <td>Never married</td>\n",
       "      <td>Not in universe or children</td>\n",
       "      <td>Not in universe</td>\n",
       "      <td>...</td>\n",
       "      <td>Vietnam</td>\n",
       "      <td>Vietnam</td>\n",
       "      <td>Vietnam</td>\n",
       "      <td>Foreign born- Not a citizen of U S</td>\n",
       "      <td>0</td>\n",
       "      <td>Not in universe</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>95</td>\n",
       "      <td>- 50000.</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>9</td>\n",
       "      <td>Not in universe</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>Children</td>\n",
       "      <td>0</td>\n",
       "      <td>Not in universe</td>\n",
       "      <td>Never married</td>\n",
       "      <td>Not in universe or children</td>\n",
       "      <td>Not in universe</td>\n",
       "      <td>...</td>\n",
       "      <td>United-States</td>\n",
       "      <td>United-States</td>\n",
       "      <td>United-States</td>\n",
       "      <td>Native- Born in the United States</td>\n",
       "      <td>0</td>\n",
       "      <td>Not in universe</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>94</td>\n",
       "      <td>- 50000.</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>10</td>\n",
       "      <td>Not in universe</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>Children</td>\n",
       "      <td>0</td>\n",
       "      <td>Not in universe</td>\n",
       "      <td>Never married</td>\n",
       "      <td>Not in universe or children</td>\n",
       "      <td>Not in universe</td>\n",
       "      <td>...</td>\n",
       "      <td>United-States</td>\n",
       "      <td>United-States</td>\n",
       "      <td>United-States</td>\n",
       "      <td>Native- Born in the United States</td>\n",
       "      <td>0</td>\n",
       "      <td>Not in universe</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>94</td>\n",
       "      <td>- 50000.</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>48</td>\n",
       "      <td>Private</td>\n",
       "      <td>40</td>\n",
       "      <td>10</td>\n",
       "      <td>Some college but no degree</td>\n",
       "      <td>1200</td>\n",
       "      <td>Not in universe</td>\n",
       "      <td>Married-civilian spouse present</td>\n",
       "      <td>Entertainment</td>\n",
       "      <td>Professional specialty</td>\n",
       "      <td>...</td>\n",
       "      <td>Philippines</td>\n",
       "      <td>United-States</td>\n",
       "      <td>United-States</td>\n",
       "      <td>Native- Born in the United States</td>\n",
       "      <td>2</td>\n",
       "      <td>Not in universe</td>\n",
       "      <td>2</td>\n",
       "      <td>52</td>\n",
       "      <td>95</td>\n",
       "      <td>- 50000.</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>42</td>\n",
       "      <td>Private</td>\n",
       "      <td>34</td>\n",
       "      <td>3</td>\n",
       "      <td>Bachelors degree(BA AB BS)</td>\n",
       "      <td>0</td>\n",
       "      <td>Not in universe</td>\n",
       "      <td>Married-civilian spouse present</td>\n",
       "      <td>Finance insurance and real estate</td>\n",
       "      <td>Executive admin and managerial</td>\n",
       "      <td>...</td>\n",
       "      <td>United-States</td>\n",
       "      <td>United-States</td>\n",
       "      <td>United-States</td>\n",
       "      <td>Native- Born in the United States</td>\n",
       "      <td>0</td>\n",
       "      <td>Not in universe</td>\n",
       "      <td>2</td>\n",
       "      <td>52</td>\n",
       "      <td>94</td>\n",
       "      <td>- 50000.</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>28</td>\n",
       "      <td>Private</td>\n",
       "      <td>4</td>\n",
       "      <td>40</td>\n",
       "      <td>High school graduate</td>\n",
       "      <td>0</td>\n",
       "      <td>Not in universe</td>\n",
       "      <td>Never married</td>\n",
       "      <td>Construction</td>\n",
       "      <td>Handlers equip cleaners etc</td>\n",
       "      <td>...</td>\n",
       "      <td>United-States</td>\n",
       "      <td>United-States</td>\n",
       "      <td>United-States</td>\n",
       "      <td>Native- Born in the United States</td>\n",
       "      <td>0</td>\n",
       "      <td>Not in universe</td>\n",
       "      <td>2</td>\n",
       "      <td>30</td>\n",
       "      <td>95</td>\n",
       "      <td>- 50000.</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>47</td>\n",
       "      <td>Local government</td>\n",
       "      <td>43</td>\n",
       "      <td>26</td>\n",
       "      <td>Some college but no degree</td>\n",
       "      <td>876</td>\n",
       "      <td>Not in universe</td>\n",
       "      <td>Married-civilian spouse present</td>\n",
       "      <td>Education</td>\n",
       "      <td>Adm support including clerical</td>\n",
       "      <td>...</td>\n",
       "      <td>United-States</td>\n",
       "      <td>United-States</td>\n",
       "      <td>United-States</td>\n",
       "      <td>Native- Born in the United States</td>\n",
       "      <td>0</td>\n",
       "      <td>Not in universe</td>\n",
       "      <td>2</td>\n",
       "      <td>52</td>\n",
       "      <td>95</td>\n",
       "      <td>- 50000.</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>34</td>\n",
       "      <td>Private</td>\n",
       "      <td>4</td>\n",
       "      <td>37</td>\n",
       "      <td>Some college but no degree</td>\n",
       "      <td>0</td>\n",
       "      <td>Not in universe</td>\n",
       "      <td>Married-civilian spouse present</td>\n",
       "      <td>Construction</td>\n",
       "      <td>Machine operators assmblrs &amp; inspctrs</td>\n",
       "      <td>...</td>\n",
       "      <td>United-States</td>\n",
       "      <td>United-States</td>\n",
       "      <td>United-States</td>\n",
       "      <td>Native- Born in the United States</td>\n",
       "      <td>0</td>\n",
       "      <td>Not in universe</td>\n",
       "      <td>2</td>\n",
       "      <td>52</td>\n",
       "      <td>94</td>\n",
       "      <td>- 50000.</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>10 rows × 42 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   age                  class of worker  detailed industry recode  \\\n",
       "0   73                  Not in universe                         0   \n",
       "1   58   Self-employed-not incorporated                         4   \n",
       "2   18                  Not in universe                         0   \n",
       "3    9                  Not in universe                         0   \n",
       "4   10                  Not in universe                         0   \n",
       "5   48                          Private                        40   \n",
       "6   42                          Private                        34   \n",
       "7   28                          Private                         4   \n",
       "8   47                 Local government                        43   \n",
       "9   34                          Private                         4   \n",
       "\n",
       "   detailed occupation recode                    education  wage per hour  \\\n",
       "0                           0         High school graduate              0   \n",
       "1                          34   Some college but no degree              0   \n",
       "2                           0                   10th grade              0   \n",
       "3                           0                     Children              0   \n",
       "4                           0                     Children              0   \n",
       "5                          10   Some college but no degree           1200   \n",
       "6                           3   Bachelors degree(BA AB BS)              0   \n",
       "7                          40         High school graduate              0   \n",
       "8                          26   Some college but no degree            876   \n",
       "9                          37   Some college but no degree              0   \n",
       "\n",
       "  enroll in edu inst last wk                      marital stat  \\\n",
       "0            Not in universe                           Widowed   \n",
       "1            Not in universe                          Divorced   \n",
       "2                High school                     Never married   \n",
       "3            Not in universe                     Never married   \n",
       "4            Not in universe                     Never married   \n",
       "5            Not in universe   Married-civilian spouse present   \n",
       "6            Not in universe   Married-civilian spouse present   \n",
       "7            Not in universe                     Never married   \n",
       "8            Not in universe   Married-civilian spouse present   \n",
       "9            Not in universe   Married-civilian spouse present   \n",
       "\n",
       "                  major industry code                   major occupation code  \\\n",
       "0         Not in universe or children                         Not in universe   \n",
       "1                        Construction     Precision production craft & repair   \n",
       "2         Not in universe or children                         Not in universe   \n",
       "3         Not in universe or children                         Not in universe   \n",
       "4         Not in universe or children                         Not in universe   \n",
       "5                       Entertainment                  Professional specialty   \n",
       "6   Finance insurance and real estate          Executive admin and managerial   \n",
       "7                        Construction            Handlers equip cleaners etc    \n",
       "8                           Education          Adm support including clerical   \n",
       "9                        Construction   Machine operators assmblrs & inspctrs   \n",
       "\n",
       "   ... country of birth father country of birth mother country of birth self  \\\n",
       "0  ...           United-States           United-States         United-States   \n",
       "1  ...           United-States           United-States         United-States   \n",
       "2  ...                 Vietnam                 Vietnam               Vietnam   \n",
       "3  ...           United-States           United-States         United-States   \n",
       "4  ...           United-States           United-States         United-States   \n",
       "5  ...             Philippines           United-States         United-States   \n",
       "6  ...           United-States           United-States         United-States   \n",
       "7  ...           United-States           United-States         United-States   \n",
       "8  ...           United-States           United-States         United-States   \n",
       "9  ...           United-States           United-States         United-States   \n",
       "\n",
       "                            citizenship own business or self employed  \\\n",
       "0     Native- Born in the United States                             0   \n",
       "1     Native- Born in the United States                             0   \n",
       "2   Foreign born- Not a citizen of U S                              0   \n",
       "3     Native- Born in the United States                             0   \n",
       "4     Native- Born in the United States                             0   \n",
       "5     Native- Born in the United States                             2   \n",
       "6     Native- Born in the United States                             0   \n",
       "7     Native- Born in the United States                             0   \n",
       "8     Native- Born in the United States                             0   \n",
       "9     Native- Born in the United States                             0   \n",
       "\n",
       "  fill inc questionnaire for veteran's admin  veterans benefits  \\\n",
       "0                            Not in universe                  2   \n",
       "1                            Not in universe                  2   \n",
       "2                            Not in universe                  2   \n",
       "3                            Not in universe                  0   \n",
       "4                            Not in universe                  0   \n",
       "5                            Not in universe                  2   \n",
       "6                            Not in universe                  2   \n",
       "7                            Not in universe                  2   \n",
       "8                            Not in universe                  2   \n",
       "9                            Not in universe                  2   \n",
       "\n",
       "   weeks worked in year  year     income  \n",
       "0                     0    95   - 50000.  \n",
       "1                    52    94   - 50000.  \n",
       "2                     0    95   - 50000.  \n",
       "3                     0    94   - 50000.  \n",
       "4                     0    94   - 50000.  \n",
       "5                    52    95   - 50000.  \n",
       "6                    52    94   - 50000.  \n",
       "7                    30    95   - 50000.  \n",
       "8                    52    95   - 50000.  \n",
       "9                    52    94   - 50000.  \n",
       "\n",
       "[10 rows x 42 columns]"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "\n",
    "input_data = 's3://sagemaker-sample-data-{}/processing/census/census-income.csv'.format(region)\n",
    "df = pd.read_csv(input_data, nrows=10)\n",
    "df.head(n=10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "sklearn_processor = SKLearnProcessor(framework_version='0.20.0',\n",
    "                                     role=role,\n",
    "                                     instance_type='ml.m5.xlarge',\n",
    "                                     instance_count=3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# TODO:  Fix this\n",
    "This notebook cell writes a file `preprocessing.py`, which contains the pre-processing script. You can update the script, and rerun this cell to overwrite `preprocessing.py`. You run this as a processing job in the next cell. \n",
    "\n",
    "In this script, you\n",
    "\n",
    "* Remove duplicates and rows with conflicting data\n",
    "* transform the target `income` column into a column containing two labels.\n",
    "* transform the `age` and `num persons worked for employer` numerical columns into categorical features by binning them\n",
    "* scale the continuous `capital gains`, `capital losses`, and `dividends from stocks` so they're suitable for training\n",
    "* encode the `education`, `major industry code`, `class of worker` so they're suitable for training\n",
    "* split the data into training and test datasets, and saves the training features and labels and test features and labels.\n",
    "\n",
    "Our training script will use the pre-processed training features and labels to train a model, and our model evaluation script will use the trained model and pre-processed test features and labels to evaluate the model."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Run this script as a processing job. Use the `SKLearnProcessor.run()` method. You give the `run()` method one `ProcessingInput` where the `source` is the census dataset in Amazon S3, and the `destination` is where the script reads this data from, in this case `/opt/ml/processing/input`. These local paths inside the processing container must begin with `/opt/ml/processing/`.\n",
    "\n",
    "Also give the `run()` method a `ProcessingOutput`, where the `source` is the path the script writes output data to. For outputs, the `destination` defaults to an S3 bucket that the Amazon SageMaker Python SDK creates for you, following the format `s3://sagemaker-<region>-<account_id>/<processing_job_name>/output/<output_name/`. You also give the ProcessingOutputs values for `output_name`, to make it easier to retrieve these output artifacts after the job is run.\n",
    "\n",
    "The `arguments` parameter in the `run()` method are command-line arguments in our `preprocessing.py` script."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Job Name:  sagemaker-scikit-learn-2020-03-02-04-59-39-911\n",
      "Inputs:  [{'InputName': 'input-1', 'S3Input': {'S3Uri': 's3://sagemaker-sample-data-us-east-1/processing/census/census-income.csv', 'LocalPath': '/opt/ml/processing/input', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}, {'InputName': 'code', 'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-835319576252/sagemaker-scikit-learn-2020-03-02-04-59-39-911/input/code/preprocessing.py', 'LocalPath': '/opt/ml/processing/input/code', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}]\n",
      "Outputs:  [{'OutputName': 'train_data', 'S3Output': {'S3Uri': 's3://sagemaker-us-east-1-835319576252/sagemaker-scikit-learn-2020-03-02-04-59-39-911/output/train_data', 'LocalPath': '/opt/ml/processing/train', 'S3UploadMode': 'EndOfJob'}}, {'OutputName': 'validation_data', 'S3Output': {'S3Uri': 's3://sagemaker-us-east-1-835319576252/sagemaker-scikit-learn-2020-03-02-04-59-39-911/output/validation_data', 'LocalPath': '/opt/ml/processing/validation', 'S3UploadMode': 'EndOfJob'}}, {'OutputName': 'test_data', 'S3Output': {'S3Uri': 's3://sagemaker-us-east-1-835319576252/sagemaker-scikit-learn-2020-03-02-04-59-39-911/output/test_data', 'LocalPath': '/opt/ml/processing/test', 'S3UploadMode': 'EndOfJob'}}]\n",
      ".........................\u001b[32m/miniconda3/lib/python3.7/site-packages/sklearn/externals/joblib/externals/cloudpickle/cloudpickle.py:47: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses\n",
      "  import imp\u001b[0m\n",
      "\u001b[32mReceived arguments Namespace(train_test_split_ratio=0.2)\u001b[0m\n",
      "\u001b[32mReading input data from /opt/ml/processing/input/census-income.csv\u001b[0m\n",
      "\u001b[35m/miniconda3/lib/python3.7/site-packages/sklearn/externals/joblib/externals/cloudpickle/cloudpickle.py:47: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses\n",
      "  import imp\u001b[0m\n",
      "\u001b[35mReceived arguments Namespace(train_test_split_ratio=0.2)\u001b[0m\n",
      "\u001b[35mReading input data from /opt/ml/processing/input/census-income.csv\u001b[0m\n",
      "\u001b[35mData after cleaning: (68285, 9), 11401 positive examples, 56884 negative examples\u001b[0m\n",
      "\u001b[35mSplitting data into train and test sets with ratio 0.2\u001b[0m\n",
      "\u001b[35mRunning preprocessing and feature engineering transformations\u001b[0m\n",
      "\u001b[35mTrain data shape after preprocessing: (54628, 73)\u001b[0m\n",
      "\u001b[35mTest data shape after preprocessing: (13657, 73)\u001b[0m\n",
      "\u001b[35mSaving training features to /opt/ml/processing/train/train_features.csv\u001b[0m\n",
      "\u001b[32mData after cleaning: (68285, 9), 11401 positive examples, 56884 negative examples\u001b[0m\n",
      "\u001b[32mSplitting data into train and test sets with ratio 0.2\u001b[0m\n",
      "\u001b[32mRunning preprocessing and feature engineering transformations\u001b[0m\n",
      "\u001b[32mTrain data shape after preprocessing: (54628, 73)\u001b[0m\n",
      "\u001b[32mTest data shape after preprocessing: (13657, 73)\u001b[0m\n",
      "\u001b[32mSaving training features to /opt/ml/processing/train/train_features.csv\u001b[0m\n",
      "\u001b[34m/miniconda3/lib/python3.7/site-packages/sklearn/externals/joblib/externals/cloudpickle/cloudpickle.py:47: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses\n",
      "  import imp\u001b[0m\n",
      "\u001b[34mReceived arguments Namespace(train_test_split_ratio=0.2)\u001b[0m\n",
      "\u001b[34mReading input data from /opt/ml/processing/input/census-income.csv\u001b[0m\n",
      "\u001b[34mData after cleaning: (68285, 9), 11401 positive examples, 56884 negative examples\u001b[0m\n",
      "\u001b[34mSplitting data into train and test sets with ratio 0.2\u001b[0m\n",
      "\u001b[34mRunning preprocessing and feature engineering transformations\u001b[0m\n",
      "\u001b[34mTrain data shape after preprocessing: (54628, 73)\u001b[0m\n",
      "\u001b[34mTest data shape after preprocessing: (13657, 73)\u001b[0m\n",
      "\u001b[34mSaving training features to /opt/ml/processing/train/train_features.csv\u001b[0m\n",
      "\u001b[32mSaving test features to /opt/ml/processing/test/test_features.csv\u001b[0m\n",
      "\u001b[34mSaving test features to /opt/ml/processing/test/test_features.csv\u001b[0m\n",
      "\u001b[35mSaving test features to /opt/ml/processing/test/test_features.csv\u001b[0m\n",
      "\u001b[35mSaving training labels to /opt/ml/processing/train/train_labels.csv\u001b[0m\n",
      "\u001b[35mSaving test labels to /opt/ml/processing/test/test_labels.csv\u001b[0m\n",
      "\u001b[32mSaving training labels to /opt/ml/processing/train/train_labels.csv\u001b[0m\n",
      "\u001b[32mSaving test labels to /opt/ml/processing/test/test_labels.csv\u001b[0m\n",
      "\u001b[34mSaving training labels to /opt/ml/processing/train/train_labels.csv\u001b[0m\n",
      "\u001b[34mSaving test labels to /opt/ml/processing/test/test_labels.csv\u001b[0m\n",
      "\n"
     ]
    }
   ],
   "source": [
    "from sagemaker.processing import ProcessingInput, ProcessingOutput\n",
    "\n",
    "sklearn_processor.run(code='preprocessing.py',\n",
    "                      inputs=[ProcessingInput(\n",
    "                        source=input_data,\n",
    "                        destination='/opt/ml/processing/input')],\n",
    "                      outputs=[ProcessingOutput(output_name='train_data',\n",
    "                                                source='/opt/ml/processing/train'),\n",
    "                               ProcessingOutput(output_name='validation_data',\n",
    "                                                source='/opt/ml/processing/validation'),\n",
    "                               ProcessingOutput(output_name='test_data',\n",
    "                                                source='/opt/ml/processing/test')],                      \n",
    "                      arguments=['--train-test-split-ratio', '0.2']\n",
    "                     )\n",
    "\n",
    "preprocessing_job_description = sklearn_processor.jobs[-1].describe()\n",
    "\n",
    "output_config = preprocessing_job_description['ProcessingOutputConfig']\n",
    "for output in output_config['Outputs']:\n",
    "    if output['OutputName'] == 'train_data':\n",
    "        preprocessed_train_data = output['S3Output']['S3Uri']\n",
    "    if output['OutputName'] == 'validation_data':\n",
    "        preprocessed_validation_data = output['S3Output']['S3Uri']\n",
    "    if output['OutputName'] == 'test_data':\n",
    "        preprocessed_test_data = output['S3Output']['S3Uri']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<b>Review <a href=\"https://console.aws.amazon.com/cloudwatch/home?region=us-east-1#logStream:group=/aws/sagemaker/ProcessingJobs;prefix=amazon-reviews-scikit-processor-2020-03-02-04-59-28;streamFilter=typeLogStreamPrefix\">CloudWatch Logs</a> After About 5 Minutes</b>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from IPython.core.display import display, HTML\n",
    "display(HTML('<b>Review <a href=\"https://console.aws.amazon.com/cloudwatch/home?region={}#logStream:group=/aws/sagemaker/ProcessingJobs;prefix={};streamFilter=typeLogStreamPrefix\">CloudWatch Logs</a> After About 5 Minutes</b>'.format(region, processing_job_name)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now inspect the output of the pre-processing job, which consists of the processed features."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Training features shape: (10, 73)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0.0</th>\n",
       "      <th>0.0.1</th>\n",
       "      <th>0.0.2</th>\n",
       "      <th>0.0.3</th>\n",
       "      <th>0.0.4</th>\n",
       "      <th>1.0</th>\n",
       "      <th>0.0.5</th>\n",
       "      <th>0.0.6</th>\n",
       "      <th>0.0.7</th>\n",
       "      <th>0.0.8</th>\n",
       "      <th>...</th>\n",
       "      <th>0.0.56</th>\n",
       "      <th>0.0.57</th>\n",
       "      <th>0.0.58</th>\n",
       "      <th>0.0.59</th>\n",
       "      <th>0.0.60</th>\n",
       "      <th>1.0.4</th>\n",
       "      <th>0.0.61</th>\n",
       "      <th>0.0.62</th>\n",
       "      <th>0.0.63</th>\n",
       "      <th>0.0.64</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>10 rows × 73 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   0.0  0.0.1  0.0.2  0.0.3  0.0.4  1.0  0.0.5  0.0.6  0.0.7  0.0.8  ...  \\\n",
       "0  0.0    0.0    0.0    0.0    0.0  0.0    0.0    0.0    0.0    1.0  ...   \n",
       "1  0.0    0.0    0.0    0.0    0.0  1.0    0.0    0.0    0.0    0.0  ...   \n",
       "2  0.0    0.0    0.0    0.0    0.0  0.0    0.0    0.0    1.0    0.0  ...   \n",
       "3  0.0    0.0    1.0    0.0    0.0  0.0    0.0    0.0    0.0    0.0  ...   \n",
       "4  0.0    0.0    0.0    1.0    0.0  0.0    0.0    0.0    0.0    0.0  ...   \n",
       "5  0.0    0.0    0.0    0.0    0.0  0.0    0.0    1.0    0.0    0.0  ...   \n",
       "6  0.0    0.0    0.0    0.0    0.0  1.0    0.0    0.0    0.0    0.0  ...   \n",
       "7  0.0    0.0    0.0    0.0    0.0  0.0    1.0    0.0    0.0    0.0  ...   \n",
       "8  0.0    0.0    0.0    0.0    0.0  0.0    1.0    0.0    0.0    0.0  ...   \n",
       "9  1.0    0.0    0.0    0.0    0.0  0.0    0.0    0.0    0.0    0.0  ...   \n",
       "\n",
       "   0.0.56  0.0.57  0.0.58  0.0.59  0.0.60  1.0.4  0.0.61  0.0.62  0.0.63  \\\n",
       "0     0.0     0.0     0.0     0.0     1.0    0.0     0.0     0.0     0.0   \n",
       "1     0.0     0.0     0.0     0.0     0.0    1.0     0.0     0.0     0.0   \n",
       "2     0.0     0.0     1.0     0.0     0.0    0.0     0.0     0.0     0.0   \n",
       "3     0.0     0.0     0.0     0.0     0.0    1.0     0.0     0.0     0.0   \n",
       "4     0.0     0.0     0.0     0.0     0.0    0.0     1.0     0.0     0.0   \n",
       "5     0.0     0.0     0.0     0.0     0.0    0.0     1.0     0.0     0.0   \n",
       "6     0.0     0.0     0.0     0.0     0.0    1.0     0.0     0.0     0.0   \n",
       "7     0.0     0.0     0.0     0.0     0.0    1.0     0.0     0.0     0.0   \n",
       "8     0.0     0.0     0.0     0.0     0.0    1.0     0.0     0.0     0.0   \n",
       "9     1.0     0.0     0.0     0.0     0.0    1.0     0.0     0.0     0.0   \n",
       "\n",
       "   0.0.64  \n",
       "0     0.0  \n",
       "1     0.0  \n",
       "2     0.0  \n",
       "3     0.0  \n",
       "4     0.0  \n",
       "5     0.0  \n",
       "6     0.0  \n",
       "7     0.0  \n",
       "8     0.0  \n",
       "9     0.0  \n",
       "\n",
       "[10 rows x 73 columns]"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "train_features = pd.read_csv(preprocessed_train_data + '/train_features.csv', nrows=10)\n",
    "print('Training features shape: {}'.format(train_features.shape))\n",
    "train_features.head(n=10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Training using the pre-processed data\n",
    "\n",
    "We create a `SKLearn` instance, which we will use to run a training job using the training script `train.py`.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker.sklearn.estimator import SKLearn\n",
    "\n",
    "sklearn = SKLearn(\n",
    "    entry_point='train.py',\n",
    "    train_instance_type=\"ml.m5.xlarge\",\n",
    "    role=role)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The training script `train.py` trains a logistic regression model on the training data, and saves the model to the `/opt/ml/model` directory, which Amazon SageMaker tars and uploads into a `model.tar.gz` file into S3 at the end of the training job."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Run the training job using `train.py` on the preprocessed training data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sklearn.fit({'train': preprocessed_train_data})\n",
    "train_job_description = sklearn.jobs[-1].describe()\n",
    "model_data_s3_uri = '{}{}/{}'.format(\n",
    "    train_job_description['OutputDataConfig']['S3OutputPath'],\n",
    "    train_job_description['TrainingJobName'],\n",
    "    'output/model.tar.gz')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Model Evaluation\n",
    "\n",
    "`evaluation.py` is the model evaluation script. Since the script also runs using scikit-learn as a dependency,  run this using the `SKLearnProcessor` you created previously. This script takes the trained model and the test dataset as input, and produces a JSON file containing classification evaluation metrics, including precision, recall, and F1 score for each label, and accuracy and ROC AUC for the model.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "from sagemaker.s3 import S3Downloader\n",
    "\n",
    "sklearn_processor.run(code='evaluation.py',\n",
    "                      inputs=[ProcessingInput(\n",
    "                                  source=model_data_s3_uri,\n",
    "                                  destination='/opt/ml/processing/model'),\n",
    "                              ProcessingInput(\n",
    "                                  source=preprocessed_test_data,\n",
    "                                  destination='/opt/ml/processing/test')],\n",
    "                      outputs=[ProcessingOutput(output_name='evaluation',\n",
    "                                  source='/opt/ml/processing/evaluation')]\n",
    "                     )                    \n",
    "evaluation_job_description = sklearn_processor.jobs[-1].describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now retrieve the file `evaluation.json` from Amazon S3, which contains the evaluation report."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "evaluation_output_config = evaluation_job_description['ProcessingOutputConfig']\n",
    "for output in evaluation_output_config['Outputs']:\n",
    "    if output['OutputName'] == 'evaluation':\n",
    "        evaluation_s3_uri = output['S3Output']['S3Uri'] + '/evaluation.json'\n",
    "        break\n",
    "\n",
    "evaluation_output = S3Downloader.read_file(evaluation_s3_uri)\n",
    "evaluation_output_dict = json.loads(evaluation_output)\n",
    "print(json.dumps(evaluation_output_dict, sort_keys=True, indent=4))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Running processing jobs with your own dependencies\n",
    "\n",
    "Above, you used a processing container that has scikit-learn installed, but you can run your own processing container in your processing job as well, and still provide a script to run within your processing container.\n",
    "\n",
    "Below, you walk through how to create a processing container, and how to use a `ScriptProcessor` to run your own code within a container. Create a scikit-learn container and run a processing job using the same `preprocessing.py` script you used above. You can provide your own dependencies inside this container to run your processing script with."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This block of code builds the container using the `docker` command, creates an Amazon Elastic Container Registry (Amazon ECR) repository, and pushes the image to Amazon ECR."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import boto3\n",
    "\n",
    "account_id = boto3.client('sts').get_caller_identity().get('Account')\n",
    "ecr_repository = 'sagemaker-processing-container'\n",
    "tag = ':latest'\n",
    "processing_repository_uri = '{}.dkr.ecr.{}.amazonaws.com/{}'.format(account_id, region, ecr_repository + tag)\n",
    "\n",
    "# Create ECR repository and push docker image\n",
    "!docker build -t $ecr_repository docker\n",
    "!$(aws ecr get-login --region $region --registry-ids $account_id --no-include-email)\n",
    "!aws ecr create-repository --repository-name $ecr_repository\n",
    "!docker tag {ecr_repository + tag} $processing_repository_uri\n",
    "!docker push $processing_repository_uri"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `ScriptProcessor` class lets you run a command inside this container, which you can use to run your own script."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker.processing import ScriptProcessor\n",
    "\n",
    "script_processor = ScriptProcessor(command=['python3'],\n",
    "                image_uri=processing_repository_uri,\n",
    "                role=role,\n",
    "                instance_count=1,\n",
    "                instance_type='ml.m5.xlarge')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Run the same `preprocessing.py` script you ran above, but now, this code is running inside of the Docker container you built in this notebook, not the scikit-learn image maintained by Amazon SageMaker. You can add the dependencies to the Docker image, and run your own pre-processing, feature-engineering, and model evaluation scripts inside of this container."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "script_processor.run(code='preprocessing.py',\n",
    "                      inputs=[ProcessingInput(\n",
    "                        source=input_data,\n",
    "                        destination='/opt/ml/processing/input')],\n",
    "                      outputs=[ProcessingOutput(output_name='train_data',\n",
    "                                                source='/opt/ml/processing/train'),\n",
    "                               ProcessingOutput(output_name='test_data',\n",
    "                                                source='/opt/ml/processing/test')],\n",
    "                      arguments=['--train-test-split-ratio', '0.2']\n",
    "                     )\n",
    "script_processor_job_description = script_processor.jobs[-1].describe()\n",
    "print(script_processor_job_description)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "conda_python3",
   "language": "python",
   "name": "conda_python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
